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ABSTRACT 

In the first of three experiments, university 
undergraduates were presented a list of 300 words and 100 nonwords in 
two sessions, Their confidence that an item was a word was indicated 
for each item on a six-point scale. This experiment demonstrated the 
feasibility of creating a recognition test of vocabulary. In 
Experiment 2, 100 items were chosen to form a subtest, and the 
subtest was cross validated on a new sample of subjects. The test in 
Experiments 1 and 2 were scored using signal-detection measures. The 
primary criterion, SAT (verbal) scores, correlated approximately ,60 
with the test scores. In Experiment 3 subjects scaled the words and 
nonwords for four psychological attributes. These were submitted to a 
stepwise regression analysis with the confidence ratings from 
Experiment 1 as the dependent variable. It was concluded that 
associability, frequency, orthography, and pronounceability all may 
be components of word recognition, However, only frequency was found 
to be a significant predictor of the confidence of recognition of 
nonwords, (Author) 
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A RECOGNITION. TEST OF VOCABULARY USING SIGNAL-DETECTION MEASURES, 
AND SOME CORRELATES OF WORD AND NONWORD RECOGNITION 
Joel Zimmei-man, Paul K. Broder 
John J. ShaughneSsy, and Benton J, Underwood 
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Abstract 

In the first of 3 experiments, university un ' rgraduates were pre- 
sented a list of "^00 words and IC lonwor is in two sessions. Their con- 
fidenc^e that an item was a word wr: indic-::ed for ejch item on a 6-point 
scale, This experiment demonstrac / ' the feasibility of creating a re- 
cognition te.st of vocabulary. In :.::perimenL II, 00 items Were chosen to 
form a -/obtest, and the subtest -j^s cross ^aiidat rn new sample of 
subjec.:: Th - tests in Ixper^:! - '3 aid II -^ere sing Pignal- 

.etection me: ures. The primary criterion, SAT (verbal) scores, correlated 
approximately .60 with the test scores. In Experiment III subjects scaled 
the words and nonwords for 4 psychological attributes. These were sub- 
mitted to a stepwise regression analysis with the confidence ratings from 
Experiment I as the dependent variable. It was concluded that associability , 
frequency, orthography, and pronounceability all may be components of word 
recognition. However, only frequency was found to be a significant pre- 
dictor of the confidence of recognition of nonwords. 
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A RECOGNITION TEST OF VOCABULARY USING S TGNAL- DETECT ION MEASURES, 
AND SOME CORRELATES OF WORD AND NONT>/ORD RECOGNITION^ 
Joel Zimmerman, Paul K. Broder 
John J. Shaughnessy, and Benton J. Underwood 
Northwestern University 

In some recent experiments (e.g., Carroll, 1971^ subjects have been 
presented a set of words anci isked to dge the fre encies with which 
such words occur in printed }:.nr-lish. This procedur- r ^sts on tlie assump- 
tion chat subjr. :s keep some :nd of "tally" in tiiei - ay to clay interac- 
tion with printed English. ~ also assuiies, howc er, hnt subjects have 
:ruly seen the "^irds which a :- r-^ing tes d , an sumt^:- -7hi ch -v have 
baslr; in trutli for ver ' : equency ^rds . 

^i-:sume there was a sample of words which were distributed along a 
continuum according to their true frequencies of occurrence in printed 
English, from very frequent to very infrequent. For any hypothetical 
subject, this sample of words could be thought of as a monotonically de- 
creasing scale of word familiarity. Words at the beginning of the con- 
tinuum, would surely be recognized. There would be some point on this 
continuum, though, such that all words past such a point would never be- 
fore have been seen in printed English by this subject. In judging 
frequency, the subject ought to rate each word past this point a "zero,^* 
and we would say at this point that the subject now fails to recognize 
any further words. 

The concern of the present study is not with the ability to judge 
word frequency, but with the matter of recognition. At least two lines of 

^The authors wish to thank Dr. Robert Sekuler for his advice regarding 
s^ffnal detection theory. 
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investigation become apparent. The first of these concerns individual 
differences in recognition^ and the second concerns the psycholo^ c^l 
processes involved in recognition. 

Considering^ again^ the continuum of words with differing background 
frequencies^ it would be reasonable to suppose that the point on this 
continuum at which a subject would no longer recognize words would be 
quite different from subject to subject. One might suppose that as a 
subject's vocabulary increased^ he would be able to proceed farth: ad 
farther dow:i t: „ .^m be for r^^.ching the hypothucical point. This 

is to say that subjects with better vocabularies should be able to re- 
cognize more words ^ an intuitively appealing statement which would pro^ 
bably be accepted by most people without any prefatory rationale. 

One purpose of this experiment was to determine whether a test 
might be developed to assess vocabulary skills based on this principle^ 
i.e., that people with better vocabularies will recognize more words as 
words. Subjects were thus shown a series of words and asked whether or 
not they recognized each one. To make the task realistic, a fourth of 
the items presented were nonwords, which the subject was not expected 
to recognize as words. The subject expressed his confidence in each 
recognition judgment through the use of a category rating scale. Word 
recognition ability was assessed using a measure derived from signal 
detection theory. This measure served as an estimate of the subject's 
ability which theoretically would be free from the effect of a subject's 
^ is in using the rating scale. 
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The second line of interest in the experiment was to inquire into the 
psychology of word recognition. On what basis does a subject decide that 
he recognizes a word? Further consideration will be given to this problem 
in the introduction to Experiment III. 

Experiment I 

Method 

Materials ♦ The complete test form was composed of 400 stimuli^ of 
which 300 were xvords and 100 were nonwords. A representative sample of 
400 words was drawn from a standard English language dictionary (G. & M. 
Merriam Co.^ 1963) • No compound words^ hyphenated words^. or homographs 
were allowed. Words had to be at least four^ and not more than 10 letters 
in length. From this pool^ 100 words were selected randomly and clustered 
by number of syllables. Corresponding. syllables of words within these 
clusters were then interchanged in a random manner to produce nonworris. 
One-syllable words were arbitrarily divided into two parts and these 
parts were interchanged, Nonwords which resulted in combinations of 
syllables which^ in the opinion of the first author^ were extremely 
difficult to pronounce or which resulted in true words were subjected to 
a second or third random interchange of syllables with other such items. 
After three such interchanges^ about five nonwords were still extremely 
difficult to pronounce^ and minimal changes were made in the letters of 
these items to make them pronounceable. By this method^ 100 nonwords 
were created which had about the same average lengthy number of syllables^ 
^^'^ letter frequency as did the 100 real words from which they came. 

ERIC 
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An unabridged dictionary (Stein & Urdang^ 1967) was then checked to affirm 
that these items were not^ in fact^ words. 

The 300 remaining words and the 100 nonwords were randomly placed 
into 10 groups of 40 with no restrictions. The 40 items in each group 
appeared in lower-case type in two columns of 20 items each on plain 8 1/2 
X 11 in. paper. Placement of. an item within a column was random. At the 
top of each page was space for the subject's name and the date. There was 
also an explanation of the six-point scale which the subject was to use as 
follows : 

1 means you are very sure this is_ not a word 

2 means you think this is not a word 

3. means you guess this is probably not a word 

4 means you guess this is probably a word 

5 means you think this is a word 

6 means you are very sure this is a word 

Following each word on the page were the numbers from 1 to 6 in a row. 

A test booklet contained each of the 10 pages of 40 items. The pages 
were arranged according to a 10 X 10 Latin square to assure that each 
page would be viewed in each position an equal number of timet, across a 
group of 10 subjects. Since subjects were to judge only five pages per 
day^ each group of 10 test booklets constructed from a given Latin square 
was duplicated except that the first five pages and the last five pages 
were interchanged. Ten Latin squares were selected randomly^ allowing 
for the construction of 200 test booklets. 

EKLC 
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Procedure > Subjects were 200 Northwestern University undergraduates 
who served in this experiment either voluntarily or to fulfill a course 
requirement. Each subject served on two successive days^ judging 200 items 
in five pages of a test booklet during each session. Instructions were 
read which stated that this was an attempt to gather information on word 
famili,^rity prior to using these items in constructing a vocabulary test. 
Subjects were told that some of the items were not really words^ but that 
most were real words. The rating scale was explained^ and the subjects 
were instructed to read through the items in order^ circling one of the 
six numbers next to each item to indicate their certainty that a given 
item was or was not a word* llie instructions made it clear that a word 
was not to be doubted on the basis of spelling^ i.e,^ all items were to 
be considered as correctly spelled. Subjects were asked to indicate their 
high school rank at graduation^ and their scores on tue verbal section of 
the Scholatic Aptitude Test, 

Subjects were tested in groups varying in size from 1 to about 25^ 
and at times convenient to them. The second session for a subject occurred 
from 18 hours to 30 hours after the first. For most subjects^ the period 
was 24 hours. One subject failed to return within these time bounds^ his 
first day's data were di^ carded^ and an identical test booklet was con- 
structed and given to the next subject. During analysis^ two subjects were 
found to have skipped a page^ and their data V7ere discarded and replaced. 
One subject was found to have clearly and consistently reversed the six- 
point scale. This subject's data were corrected and retained. 

EKLC 
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One word^ ^'sabadilla*'^ was misspelled and printed as ^^sabadilia" on 
the test sheets. The word was scored as though it had been printed 
correctly^ and scores for this item were retained in all analyses. 
Res ults 

Scaling results . The 100 nonwordy and the 300 words are listed 
alphabetically in the Appendix. Following each item is the mean judgment 
given the item and the standard deviation of the judgments. Also included 
in this table is a measure of internal consistency which is the correla- 
tion between the judgment made on this item^ and the mean judgment made 
on all such items (nonwords or words) by each subject. 

The mean of all judgments given to nonwords was 2.92 and the standard 
deviation was 1.30. The mean of all judgments given to words was 4.90 
with a standard deviation of 1.53. Thus^ there was slightly more varia- 
bility overall in the judgments made on words tlian on nonwords • If the 
measure taken is the mean rating given to nonwords and words by each sub- 
iect however, this conclusion is modified. The mean ratings given to 
100 nonwords by subjects varied from 1.16 to 4.75^ a range of 3.59^ and 
the standard deviation of these means was .70. The mean ratings given 
to words varied from 4.18 to 5.78 over subjects^ a range of only 1.60^ 
and the standard deviation of these means was only .32. The reason for 
these results seems to be that each subject was more consistent in his 
judgment of nonwords than in bis judgment of words^ but that different 
subjects ^/Qeve more variable in choosing a portion of the scale within which 
they chose to rate nonwords. 

EKLC 
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Prediction of criteria, 



-he basis of their availability. 



two criteria were chosen to 



alidation of this test, h' 



school rank at graduation (HSR) and the verbal score uf the Scholastic 
Aptitude Test (SAT), HSR was scored as the percent of the class not rank- 
ing as high as the subject, SAT scores are nationally normed scores 
ranging from 200 to 300. We assumed that the HSR scores would represent 
a measure of some sort of general ability^ while the SAT scares would re- 
present some measure of ability more specifically verbal in nature, SAT 
scores were^ therefore^ the more important of the two criteria. For this 
reason^ only prediction of SAT scores will be discussed at length in this 
report. Some of the correlations with HSR are listed^ however^ in Table 1. 

Subjects were requested co report their HSR and SAT scores at the 
time of the test. Eight subjects failed to report each of the scores^ 
and to facilitate analysis^ these subjects were assigned the nearest integer 
to the mean of the scores of the 192 subjects from whom scores had been 
obtained. This resulted in a mean reported SAT score of 610.40 with a 
standard deviation of 79.45 over the 200 subjects. HSR scores had a 
mean of 89.16 and a standard deviation of 10.94. 

On the most unassuming level of analysis^ one might suppose that 
those subjects who had the highest vocabulary skills would be those who 
most confidently recognized words as words^ and nonwords as nonwords . 
This would lead one to expect a positive correlation between the criterion 
and the mean rating given to words ^ and a negative correlation between 
the criterion and the mean rating given to nonwords. The correlations 
u^*-j,^g^ SAT scores and the mean judgments of words and nonwords were ,20 
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and -.13, respectively. Only the prediction of SAT scores from word rating 
Feans was significantly different from zero 't-SiO?^ df=198^ p<.01). 

Such an attempt at prediction was^ of course^ naive. For one^ it 
assumed that subjects with higher vocabulary skills would give both higher 
mear ratings to words and lower mean ratings to nonwords. This would pre- 
dict a negative correlation between mean ratings given to words and non- 
words. The computed correlation between these measures^ however^ was .76. 
At least part of this correlation must reflect the subjects' biases in 
using high or low numbers on the scale independent of the nature of the 
particular stimulus being judged. 

A measure which could overcome some of this bias would be the difference 
between the mean judgment for nonwords and the mean judgment for words for 
each subject. With increased confidence in both word and nonword recogni- 
tion^ the difference between the word and nonword judgment means for a 
subject should have increased, and ideally this should have been independent 
of the subject's bias in using some part of the scale. In fact, using this 
measure^ prediction of SAT scores increased slightly to .31. 

This measure is still deficient, however. It does not take into 
account the variability with which a subject made judgments. A given 
difference between word and nonword means increases in significance as 
the variability around those means decreases. Therefore, a way to increase 
the usefulness of the difference measure should be to standardize it with 
respect to the subject's variability in making judgments. To do this, the 
difference between word and nonword mean judgments for each subject was 
O ivided by the square root of the pooled variance of that subject's judgments. 

ERLC 



Zimmerman et al. 9 
around those means. When this was done, the correlation between this measure 
(which will hereafter be referred to as d , fur standardized difference) 
and SAT scores was to be .48. This increase in prediction over .31, 

obtained by acL iitir jv the subjects' variabilities in judgments, was 
statistically significant (t=2.23, df^l97, £<.05). 

SiRnal detection theory . Signal detection theory was first applied to 
verbal materials by Egan (1958)^ and is probably most thoroughly explicated 
by Green and Swets (1966). Reviews of the use of signal detection theory 
in memory experiments have been provided by Banks (1970) and Lockhart and 
Murdock (1970). The purpose of a signal-detection analysis is to separate 
two components of a subject's behavior, his sensitivity in responding to 
a stimulus, and his bias in responding. As indicated by Ranks, b-ras in 
responding in a verbal recognition task is often related to the idea of 
guessing, and analysis by signal-detection measures becomes, at very least, 
a sophisticated way of correcting for guessing. The application of signal 
detection theory to the present experiment is relatively straight forward. 
It is assumed that for the population of words and nonwords, the degree 
of a subject's confidence of recognition would be normally distributed 
around some mean for words and some mean for nonwords. The distributions 
are assumed to be normal and of equal variances. 

As a test of whether the present data have met the assumptions of 
signal detection theory, a memory operating characteristic (MOC) curve 
has been plotted for the group as a whole, and is presented as the top 
line in Figure 1 (ignore, for now, the other data in the figure) . The 
hit rates and false alarm rates have been transformed to standard unit 
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normal distribution equivalents. According to Lockhart et al.^ a straight 
line in a HOC curve is commonly taken as evidence that the assumption of 
normal distributions has been met, but since large deviations from normality 
will p* ' .'ves which appear to be straight lines, this is not a 

critical Lest. Nonetheless, the straight line best fit (by the method 
of least squares) for these data is quite good^ r=.998. A line with a 
slope of 1.0 is taken to be evidence for the assumption of equal variances. 
It has already been stated, however^ that the variances for nonword and 
word judgments were not equal, and this is reflected by the fact that the 
line has a slope of .55, 



The value commonly suggested as a nonparametric measure of the subject's 
recognition sensitivity is the area under the subject's MOC curve (A). 
This value is used as an index of d_' , which corresponds to the separation 
of the recognition confidence distributions for words and nonwords. It 
is just this separation which we were attempting to measure by the use of 
described above. MOC curves were derived and A was computed for every 

subject. The contention that A and d^ are theoretically equivalent was 

^ ' s 

supported by the high correlation between these measures, r=.96. The 
correlation between SAT scores and A was .44, about the same as prediction 
of SAT scores by d ,^ (r=.4S)* 

It is a theoretical question whether any single measure can be 
derived from confidence rating data to represent the subject's bias 
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Fig. 1. Memory operating characteristic curves for the group of 200 
subjects' ratings on AOO items, their ratings on the 100 item subtest, 
and for the independent group of 42 subjects' ratings on the 100 item 
subtest . 

O 
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in responding (commonly referred to as jlP or B ) . One measure used by 
McNicol and Ryder (1971)^ however^ indexes B as that unique point on the 
subject's MOC curve (determined by linear interpolation) at which the hit 
rate and false alarm rate sum to 1,0. In a situation perceived by the 
subject to have equal a priori frequencies of words and nonwords^ and 
equal "payoffs" for hits and correct rejections^ this point could be con- 
ceptualized as a theoretical point on the confidence rating scale. This 
point would represent the rating for an item for which there was maximal 
uncertainty as to whether it was a word or a nonword. This value^ was 
, calculated from each subject's MOC curve. 

Rather than working from MOC curves^ we have preferred to index 
subject bias in a more direct way. Under an assumption of normal distri- 
butions of equal variance^ the B measure described above would be identical 
to the point on the rating scale midway between the means of the word 
and nonword confidence distributions. This value (henceforth to be re- 
ferred to as for midpoint) was also computed for every subject by taking 
the simple average of the mean rating for nonwcrds and the mean rating for 
words. The two measures^ B and were found to correlate highly, l=-95. 
If all the as^^umptions of signal detectioa theory had been met^ these 
measures of subject bias would ideally be independent of the sensitivity 
measures (A and d^) and would not predict the criterion scores (SAT). In 

fact^ B correlated -.26 and -.29 with A and d ^ respectively and M correlated 

s ^ 

-.18 and -.21 with A and d , respectively. These correlations were all 
significantly different from zero (2<.05). The correlation between B and 
Q^kT scores was -.07 (p>.05) and between M and SAT scores was -.03. Thus 
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it may be concluded that the bias measures^ and were not predictors 
of SAT scores, but they were not entirely independent of the sensitivity 
measures, A and d • 

There are three important points to be gleaned from theso rl- 
First the d measure is theoretically and functionally equivalent to 

, "S 

the sensitivity measure postulated within the theory of signal detection. 
Secondly^ the sensitivity measures have been shown to have low correla- 
tions (r=-.23 on the average) ' with the measures 3 and M. Insofar as B 
and M index subjects' biases it can be concluded that, while the sensi- 
tivity measures were not completely independent of subject bias^ they 
are^ at least^ only slightly affected by bias- Finally^ the bias measures 
did not significantly correlate with the criterion scores (r=-.05 on the 
average)* Therefore, the significant prediction of criterion scores by 
the sensitivity measures was in no way due to^ and^ in fact^ must have 
been in spite of^ the degree to which the sensitivity measures reflected 
subject bias in the use of the scale. 

Official SAT scores ^ Subsequent to the above analyses, the question 
was raised as to the veracity of the subject's reports of their criterion 
scores. We were able to obtain official school records of 156 SAT and 155 
HSR scores. The subject's reported SAT scores, and those provided by the 
school for these 156 subjects correlated .90» It is interesting to QL-.a 
that the scores reported by these 156 subjects were just slightly lower 
on the average than the scores of the other 44 subjects, yet averaged 
almost 19 points higher than the scores obtained from the university 
Q cords» The correlation between the d and A measures and the official 

ERIC 



Zimmerman et al. 12 
SAT scores were slightly higher than those with the full sample of reported 
scores^ .56 and ,54^ respectively. 

m 

Reported, and official HSRs were .;l>oul equal in u,u^nitude^ and correlated 
.38 over the 155 subjects for whom the scores were available* An average 
was taken of the correlations between reported HSR scores and the two 
measures and A^. This average correlation was also calculated using 
official HSR scores^ u ing reported SAT scores^ and using official SAT 
scores^ and all these =iLverage correlations are summarized in the first 
column of Table 1. 



Testing effects . In laboratory tests of recognition^ changes in 
performance as a function of th-= test interval or the testin^> procedure 
have been of considerable interest^ and have been referred to as testing 
effects (e.g.. Underwood ^ Fraund^ 1970; Underwood, 1972). In the present 
experiment, the judgments of the items on each page were systematically 
balanced over positions within and between days. Thus, changes in sub- 
ject judgments over pages and cays could be examined free of any con- 
founding by specific item groups • The mean word judgment given by all 
subjects on each of the 10 page? of the test booklet ranged from 4.86 to 4.93 
in no systematic or statistically significant way. In the judgment of 
nonwords, however, the page eff:ect (F=2.57, df=4,796) and the day effect 
(F=5.07, d^=l,199) were significant (£<.05) though their interaction was 
not (F=l»70, df=4|796. p^,05). The mean iudgments of nonwords on the two 
Q ^}y> of testing were and 2 95, Across the five pages of a test 
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Table 1 



Average Prediction of Criterion Scores by the Sensitivity Measures^ 



and 


A, for the 400- 


•Item Test^ ' 


the 


100-Item Subtest, 


and the 




Cross Validation of the 


100- 


■Item Subtest 












100 Item 


Cross 






400 Items 




Subtest 


Validation 


SAT 


- reported 


.46 (200) 




.66 (200) 


.64 (36) 


SAT 


official 


.55 (156) 




.69 (156) 


.58 (18) 


HSR 


- reported 


.26 (200) 




.32 (200) 


.19 (40) 


HSR 


- official 


.25 (155) 




.31 (155) 


.08 (17) 



Note: Number of cases is indicated in parenthesis 
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booklet (averaged over the two days) the mean judgmeiit given to nonwords 
rose slowly but systematically from 2.88 to 2.96. Though statistically 
reliable^ the magnitude of these effects is so small as to seem empiri- 
cally unimportant. The measure being a midpoint between the word and 
nonword means also had to increase slightly over days and pages as a 
result of the increase in nonword means and the relative stability of 
word me* -js. If M were considered to be a measure of subject bias^ the 
small testing effects in M could be interpreted as a change in subjects* 
biases over days and pages resulting from a slightly but reliably in- 
creasing tendercy to guess that a nonword might be a word. 

The d measure was calculated for every page of each subject's test 
~s 

booklet. The average d^ varied from 1.44 to 1.53 in no systematic or 

s 

statistically significant way. Thus^ there was no evidence for any 
meaningful change in sensitivity over the testing interval - 

To these data can be added the correlations of the subjects' nonword 
means word means and d s calculated separately for day 1 and day 2. 
These correlations were .84^ .71^ and .58^ respectively^ which though 
high^ are not particularly noteworthy for reliabilities. In summary^ it 
is concluded that subject performance on this test was relatively stable^ 
and whatever systematic changes in behavior did occur over the testing 
interval were small^ indeed. 

Dichotomous judgments . In many recognition studies (e.g. Underwood^ 
1972) the subject's task has been to respond with a "yes'* or **no^" and 
items have been scored as either right or wrong. It is not unreasonable 
to ask whether this more simple way of responding and scoring would have 
producer! results comparable to or better than those which were achieved 
on this test using 1 to 6 confidence ratings. In accordance with the 
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labelling of the points on the rating scale^ it was assumed that auy -.'cem 
which had been rated 1^ 2^ or 3^ would have been classified as a nonwcrd 
if the subject had been making simple dichotomous judgments. Likewise^ 
items rated 4^ 5^ or 6 presumably would have been classified as words in 
such a task. All of the data for all subjects were accordingly transformed 
into dichotomous judgments and scored as correct or incorrect. Several 
procedures were used to evaluate these data^ including simple raw scores^ 
correction for guessing scores^ and signal-detection scores. These scores 
uniformly resulted in skewed distributions^ and relatively low correla- 
tions with SAT scores. It was concluded that this was not a valuable way 
to proceed^ and insofar as this procedure truly mimicked what would be 
obtained in a "yes"^"no^' test on nonwords and words^ such dichotomous 
responding does not provide data equal in quality to that obtained by 
the confidence-rating method. 

Experiment II 

The first experiment demonstrated the practicability of constructing 
a vocabulary test based on absolute judgments of word recognition^ and 
scoring it in accordance with the methods of signal detection theory. 
The purpose of the second experiment was threefold. First^ the test was 
to be decreased in length so that a subject might easily be given the 
instrument in one session. Secondly^ "bad" items were to be eliminated 
so as to increase the overall predictive power* In the remainder of the 
paper, the collection of items retained after decreasing the test in length 
will be referred to as the subtest. This subtest was initially evaluated 
5" deriving scores for the subjects in Experiment I as though these were 
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the only items which had been judged. Third and finally^ the subtest items 
were to be assembled as a separate test and judged by an independent group 
of subjects. This will be referred to as the cross validation. 
Method 

Subtest * From the pool of 400 items^ 26 nonwords and 74 words were 
chosen to be used as a 100-item subtest. To evaluate this subtest^ subject 
protocols from Experiment I were rescored as though the subjects had rated 
only these 26 nonwords and 74 words. 

Three criteria were used to select the subtest items. The primary 
consideration was to obtain a set of items such that frequency distribu- 
tions of the mean judgments on the items (from Experiment I) would be 
approximately normally distributed for nonwords and for words. Accordingly^ 
the rating scale was divided into units of 0.5 width. From the eight 
intervals along the scale beginning with 2.00 to 2.49 and ending with 5.50 
to 5.99^ the following numbers of words were selected for the subtest: 
1^6^12^18^18^ 12^6^ and 1. From the six intervals beginning with 1.50 to 
1.99 and ending with 4.00 to 4.49 the following numbers of nonwords were 
selected: l^j^9^9^3^ and 1. The second selection criterion concerned an 
item's discriminability . The rating on an item was correlated with re- 
ported SAT scores over all 200 subjects. Within a rating-scale interval^ 
those items were chosen which best discriminated among subjects accord- 
ing to this index. The third criterion for selection was a high internal 
consistency index^ as described in the Results section of Experiment I. 
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Cross val.idation . In order to cross validate the subtest^ the 100 
items were assembled separately into a r.est booklet. The items were placed 
randomly onto five pages ^ 20 items per page. The 20 randomly ordered 
items appeared in lower-case type in a single column. Except for the 
lesser number of items^ the test sheets were exactly as described in 
Experiment 1. The five test sheets were presented in the same order to 
every subject, The position (from 1 to 100) of each item which appeared 
in the subtest booklet is indicated in the last column of the Appendix. 
Items 1 through 20 appeared on page 1^ 21 through 40 on page 2^ and so on, 

The cross validation sample consisted of 42 subjects who took the 
subtest as partial fulfillment of a course requirement at Northwestern 
University, Subjects were tested in groups ranging in size from 1 to 
about 25. They were told that this experiment was being done to develop 
a new kind of vocabulary test^ and they were given instructions on rating 
the items as were subjects in Experiment I. Subjects were asked to report 
their SAT and HSR scores. No subjects were lost or replaced for any 
reason. 
Results 

Subtest . The most obvious consequence of selecting items for the 
subtest was to remove words with very high confidence ratings » Accord- 
ingly^^ the mean word rating was changed from 4*90 in the tvU 400-item 
form to 4.01 in the 100-item subtest. The standard deviation was in- 
creased slightly from 1.53 in the original to 1,62 in the subtest. The 
change in the mean judgment and standard deviation for nonwords was 
slight^ from 2.92 to 2.99 and from 1^30 to 1.35, respectively. Con- 
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sidering the mean word and nonword judgment for each subject^ the result 
vas the same. The standard deviation for the subjects' mean word judgments 
changed from ,32 to .59, whereas the standard deviation of subjects' mean 
nonword judgments changed froi . o .73, 

Some other observations nr. oe uxade to indicate the degree to which 
the subtest \vas representative of ths complete test* The subject ^s scores 
for the full test of 400 items and their scores on the 100-item subtest 
were correlated for six measures. The correlations for the mean rating 
giver, nonwords^ the mean rating given words^ and the bias measures^ M 
and B were ,95 .94. .98^ and .96^ respectively. For these four measures^ 
then^ the subtest was highly representative of the full test. The corres- 
ponding correlation for d^ was „77 and for A was .76. The correlations 
on these sensitivity measures were obviously not as high as those for the 
other measures^ and this is evidence that the subtest was primarily 
functioning to change slightly and differentially the estimates of the 
subjects ' abilities . 

The group MOC curve based on the subtest is shown as the bottom 
line in Figure 1. The movement of the line toward the major diagonal 
indicated that the average sensitivity as measured by the 100-item sub- 
test was lower than that measured by the full test of 400 items. This 
was the result of removing the easy words which had served to indis- 
criminately raise all subjects' scores. In fact^ the average As for 
the full test and subtest were .83 and .69 in that order ^ and the average 

d s were 1*42 and .73. The increase in the slope of the line to .76 
— s 

indicated that there was less difference between the word judgment 
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variability and nonword judgment variability in the 100-item subtest than 
had been present when all 400 items were considered, and this is in 
accordance with the standard deviations of those judgments reported above. 
Again^ the linear fit is quite good (r=.999), though this is not 
a particularly critical test of the normality assumption. 

The sensitivity measures^ d_^ and correlated .98 and the bias 
measures^ JB and correlated •97 on the subtest^ comparable to these 
same figures from the overall test (.96 and .95). The average inter- 
correlation of the sensitivity measures with the bias measures had been 
significantly greater than zero in the full test (r=-.23)^ but this 
correlation^ r=-.05^ did not differ significantly from zero in the sub- 
test (_g>.05). Referring again to Table 1^ it can be seen that the cor- 
relation of reported and official SAT scores with the sensitivity 
measures increased considerably on the subtest^ from .46 to .66 (N=^00) 
and from .55 to .69 (N=156) in that order. The average correlation of 
reported SATs with the bias measures (B and M) had been -.05 on the full 
test^ and was -.01 on the subtest. The corresponding values with regard 
to the official SAT scores were -*10 and -.05. 

Cross validation . With the independent group of 42 subjects taking 
the subtest_^ the mean nonword rating^, 2.98^ and standard deviation^ 1.46^^ 
were very ' comparable to those values calculated for the original sample 
of subjects (2o99 and 1.35^ respectively). These values for the word 
ratings were 4.04 and 1.71^ also comparable to those values obtained on 
the subtest with the first sample (4.01 and 1.62). The standard devia- 
tions of the mean judgments made by a subject for nonwords^ *^^9 for 
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words ^ ,44^ were lower in this sample than in the original sample (.73 
and #59^ respectively). For the cross validation sample the split half 
reliability calculated over subjects for mean nonword judgments was .86^ 
c':nd for mean word judgments was .71* (Day by day reliabilities calculated 
for the 400 items in Experiment I had been .84 and .71 for nonword means 
and word means.) Additionally^ the correlation between the mean rating 
on an item from the first sample and the mean rating for the same item 
from the cross validation was .90 for the 26 nonwords and .91 for the 74 
words ^ and these may be taken as estimates of item reliability. It is 
concluded from these data that performance on the subtest was highly 
comparable for the original and cross validation samples of subjects. 

The comparability of use of the scale by the i:wo groups with respect 
to signal detection theory may be assessed with regard to the group MOC 
curve. The "X'^ marks in Figure 1 represent the MOC points for the cross 
validation groups The line of best fit to these points has not been drawn 
in^ since it would be indistinguishable from that for the original sample^ 
and for all practical purposes^ it is apparent that the line from the - 
original sample serves to describe the cross validation sample as well. 

Out of 42 Liubjects^ 36 reported SAT scores^ and the university pro- 
vided official scores for 18 of these. The mean of the 36 reported scores 
was 607.36, with a standard deviation of 73.34o Official SAT scores 
averaged over 33 points lower than the scores which were reported by these 
18 subjects. The reported HSRs averaged 83.27 and had a standard deviation 
of 15. 69* Reported and official scores correlated only .55 for SAT and 
O 80 for HSR. 

ERLC 
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The two sensitivity measures, A and d , correlated .99 in the cross 
validation group^ but the .89 correlation between B and M was low in con- 
trast to what we had come to expect. Table 1 summarizes the prediction of 
the criteria for this group. The verage prediction of the reported SAT 
scores by the sensitivity measures was .64 and comparec ravorably with the 
.66 prediction in the first sample. The prediction of the official scores 
was disappointingly lower^ having fallen from .69 to o58^ though this 
difference was not significant (t^=.69^ df=168 , g>.05). One is tempted to 
attribute this fall to the small size of the cross validation sample for 
which official scores could be obtained (18). The average intercorrelation 
of the sensitivity and bias measures was -.18 (;g>.05) and the reported and 
official SAT scores were predicted by the bias measures with average cor- 
relations of -.02 and -.07^ respectively. 
Discussion 

From the results of this experiment^ we conclude that a recognition 
test of vocabulary scored through the use of measures derived from signal 
detection theory provides distinct promise as a tool for evaluating voca- 
bulary skills. This test of 100 ii,ems is easily administered to the average 
student in about 15 minutes. The test is easier to take than the typical 
vocabulary test of the same length which usually involves reading a word and 
searching through several alternative definitions for the one which best fits. 
While correction for guessing procedures are still a matter of theoretical 
debate with regard to the usual multiple-choice format^ this testing procedure 
•yields a separate measure of bias (or guessing tendency) as well as a sensi- 
tivity measure ♦ The resultant sensitivity measure correlates very acceptably 
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(about .60) with scores from che verbal sections of t; Scholastic Aptitude 
Tes::. It might be argued^ in fact^ that this correlation underestimates 
the validity of the test. The present test purports to be a measure of 
vocabulary^ while the SAT is presumably a measure of verbal ability in a 
more general sense. Certainly^ vocabulary skill must contribute in large 
part to the score on the SAT and no other convenient^ dependable measure 
of vocabulary skill was available for use as a criterion. If this test 
were to be validated against some more direct measure of vocabulary ability, 
though, the estimate of its validity might be even higher. 

Further, this is not meant to be the final version of a test. The 
primary purpose of this experiment was to demonstrate the feasibility of 
such an approach, not to provide a highly developed product. The items in 
the present test have been put through only one selection process, and im- 
provements in the item pool could certainly be made. In line with signal 
detection theory, for example, it might be suggested that a pool of words 
and nonwords be developed which yield more equal variabilities in judgments. 
This should increase the validity of the d measure of sensitivity. Further 
work might also be done to Improve the nature of the scale. The intervals 
in the rating scale have been assumed to be equal, but may be psychologically 
very different. A scaling of the intervals on the six-point scale into 
their proper psychological equivalents could reveal a transformation of either 
the judgment data, or the scale itself, which should serve to increase the 
validity of the measurement. 

This paper also lends another level of generality to signal detection 
^ . theory, a theory which is encountering widespread success in application 
to recognition situations of many kinds. To the best of our knowledge. 
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the prese- 1 ic^ nique . for calculating d and M ave not commonly in use. 
With two c:... fv at sa-i les of subjects^ however^ ^hese measures were shown 
to correlate I ily wi ; i A and respectively^ as they ha^^e previously 
been calculate from 5--: :ject MOC curves (see Green «Sc Swets^ 1966; McNicol 
& Ryder^ 1^71 . Of p^iriiculai relevance is the fact that d^ and M do not 
need to be <- ^ _ >ated from MOC curves, which simplifies the computations 
of these measures considerably. 

Experiment III 

Up to this point, consideration has been given only to differentiating 
among subjects on the basis of how well they can recognize words and non- 
words. No mention has been made of how such a process of recognition might 
be occurring* Yet^ a knowledge of the psychological processes involved in 
differentiating between meaningful and nonmoaningful verbal stimuli would 
be of considerable importance* 

Several investigators have proposed mechanisms to explain how words 
are recognized. McNulty (e.g., 1966) has proposed that recognition is accom- 
plished through the learning of partial information. After exposure to a 
stimulus^ the subject can not reproduce the stimulus in its entirety, but 
has retained some information about it. At the time of recognition^ the 
subject generates the partial information and checks it for a match against 
the stimulus provided. If the partial information which the subject can 
generate is entirely matched by the stimulus, the subject accepts the stim- 
ulus and says he recognizes it. McNulty proposes that such partial infor- 
mation can be structural^ such as individual letters, or associative, such 
as knowledge that an item was a member of some category. Since nonwords in 
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Experiment I were created from fragments of real v;ords^ their structures 
were sound. This point of view probably would hold^ then^ that subjects 
distinguished between wgrds and nonwords on the basis of associative pro- 
cesses^ or more properly^ the lack of them. That is^ perhaps the subject 
looked at a nonword^ decided that there was nothing with which he could con- 
sistently associate it^ and thereby gave it a low rating. Words^ on the 
other hand^ might have brought to mind familiar associations and were 
accordingly accepted as words. If this is a fair representation of McNulty's 
position^ it might seem to predict that nonwords which produce consistent 
associations might be those which are most often mistaken for words. 

Underwood (ezx^, 1968) ha* proposed that frequency is the attribute 
A 

which mediates recognition in a laboratory situation. Very simply^ each 
time a stimulus is perceived in a study list it accumulates an additional 
frequency input* When an item is presented on a recognition test^ the sub- 
ject merely checks the frequency count on the item. If it is greater than 
zero^ the item is recognized. A generalization of this theoretical position 
with respect to words would predict that the moro often an item has been 
seen or heard ^ the more likely it is to be recognized as a word. This re- 
ceived some support from data in Experiment I which revealed that for 162 
words^ the mean confidence rating given a word correlated ,30 with the 
Thorndike-Lorge "G" frequency. Nonwords^ however^ all should have frequen- 
cies of zero^ and the only way that nonwords could be differentially recog- 
nized according to a frequency theory would probably be through some con- 
sideration of relative frequencies of combinations of letters or syllables 
making up the words. 
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Another consideration of word recognition comes from Smith and Haviland 
(1972). These investigators studied the question of why perception of 
words through brief tachistoscopic exposure was more accurate than percep- 
tion of nonwords. They concluded that the perceptual unit of analysis of a 
word is the pronounceable English segment. For a nonword^ however, the 
perceptual unit of analysis is the individual letter. This might suggest 
that subjects pronounce the items^ and decide to identify an item as a word 
or nonword on the basis of pronounceiability . Though all the words in 
Experiment I were purposely made at least moderately pronounceable^ it might 
still be expected that as pronounceabi lity of the items increased^ subjects 
would be more likely to perceive them as words. Given that the items were 
not easily pronounced^ the Smith and Haviland view might lead to the ex- 
pectation that the subject then examined the individual letters^ acting 
distinctions in what has been called the orthography of the word (Zech- 
meister^ 1969). 

Perhaps the most thorough consideration of the process of re?ognition 
of words and nonwords comes from a series of studies by Rubenstein and his 
coworkers (Rubenstein^ Garfield^ 6. Millikan^ 1970; Rubenstein^ Lewis^ & 
Rubenstein^ 1971), These studies examined only the cases in which words 
and nonwords were correctly distinguished^ and inferences were made from 
the reaction times of such recognitions as a function of certain independent 
variables. Briefly^ the model which they have proposed suggests that the 
subject begins by segmenting ("quantizing") the word into letters and 
phonemes^ and recoding the phonemes into their auditory equivalents. A 
first check is made on the auditory recodings (essentially pr onounceability) 
and an item which is not pronounceable is declared to be a nonword. If the 
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item passes this first cheeky the subject next considers the individual 
letters (orthography) for acceptable English combinations. Having passed 
the orthography cheeky the subject pays attention to lexical meaning. If 
the item has meaning^ it is accepted as a word. Evidence is also presented 
to show that a subject's speed in responding is directly related to word 
familiarity or frequency, but the authors conclude on the basis of an ex- 
periment that meaning is a more important attribute of recognition than is 
frequency. It should be noted that these Investigators actually have pre- 
sented a model of temporal priorities involved in word recognition. Their 
model is essentially one of a series of steps at which an item either passes 
or fails. They do not directly consider the occurrence of an item which 
may be held in varying shades of doubt at each of the check points. Were 
such a doubtful item to occur^ though we would know the temporal order of 
the checks^ the relative importance of each of these attributes in the final 
decision would still be in question. 

In regard to the present experiment^ this model would predict that an 
item's (especially a nonword's) orthography and pronounceability might be 
related to its recognition. Among words ^ meaning and frequency would be 
expected to be important determinants of recognition. 

In order to obtain some evidence relating to these various positions^ 
the words from Experiment I were presented to an independent group of sub- 
jects to be scaled in relation to the attributes suggested above: associa- 
bility^ frequency^ orthography^ and pronounceability. 
Method 

All 300 words and 100 nonwords from Experiment I were used in the pre- 
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sent experiment. These items were divided into four groups of 100 (75 words 
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and 25 nonwords) by matching sets of four items, words and nonwords separately^ 
and randomly assigning one from each such set to each group of 100. The 
criteria for matching items were mean confidence ratings from Experiment I 
(the primary criterion)^ and the discrimination and internal consistency 
measures mentioned in Experiments I and II. The groups of 100 words were 
typed in lower-case on a single page in four columns of 25^ randomly assigned 
within a column^ with a blank preceeding each word^ and a rating scale at 
the top of the page. 

Independent groups of subjects provided four types of ratings on the 
items. Items were rated for associability (how many other words an item 
brings to mind)^ frequency (how often the item occurs in printed English)^ 
orthographic distinctiveness (how unusual or outstanding the letters or 
spelling of an item are)^ and pronounceabilit-v (how easy an item is to 
pronounce). The lowest point on the rating scale rej^resented words which 
wer £ low in associability^ low in frequency^ low in orthographic distinctive- 
ness^ or hard to pronounce. 

Instructions for the ratings were provided on a cover sheet and were 
all patterned after the instructions for rating orthographic distinctiveness 
provided by Zechmeister (1969), Subjects rated items by writing a number 
from 1 to 9 on the blank beside each item. Instructions asked the subjects 
to rate all the words , No mention was made that some of the items were 
TiL ^ words . 

Four pages (of 100 items each) were to be rated for each attribute. 
It was decided^ however^ that each subject would rate only 200 items^ or 
two pages ♦ Test booklets were constructed by joining pages 1 and 2 or by 
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joining pages 3 and 4* Across subjects , the two possible orders for the 
two sets of pages (1-2, 2-1, 3-4, and 4-3) were alternated to balance pro- 
gressive error. Each group of 200 items was scaled for each attribute by 
an independent group of 26 subjects. Since there were 400 items and four 
attributes, there was a total of 208 -Jubjects, The subjects were drawn 
from the same pool as in Experiments I and II, and were tested in groups 
ranging in size from 1 to about 30. 

As a result of clerical error^ two nonwords and six words were in- 
correctly typed on the rating sheets. These items have not been considered 
in the analyses to follow. Four subjects were dropped for failure to com- 
plete their rating sheets. Four subjects were randomly selected and their 
data discarded to effect equally sized groups. Three subjects had clearly 
and consistently reversed the direction of the rating scale and their data 
were corrected and retained. 

According to orthodox scaling procedures^ data generated by the scales 
used in these experiments were clearly ordinal. Up to this pointy however, 
for the sake of convenience it was assumed that the points on the scale 
represented true intervals as implied by the numbers 1 through 6^ and 
statistics were used accordingly. In Experiment TII where this threatened 
to be a more serious problem because of the lesser number of subjects, several 
analyses were done using medians instead of means. The consistent finding 
was equivalent relative results and lowered predictability^ and the use of 
medians was discontinued. 
Results 

Words and nonwords produced clear differences in means on all scales. 
:able 2 presents the mean scaled judgment and standard deviation for the con- 
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Table 2 



Means and Standard Deviations of the Recognition Confidence 
Judgments from Experiment I and Scaling Judgments from 
Experiment ill for Nonwords and Words. 





No^v^7ords 


(98) 


Words 


(2 94) 




Mean 


SD 


Mean 


SD 


Confidence 


2.92 


.51 


4.89 


1.08 


Ass ociability 


2 .40 


.58 


4.89 


1.65 


Frequency 


1.91 


.54 


4.70 


2 . 10 


Orthography 


5.54 


.93 


4.66 


1.02 


Pronounceability 


4. 19 


1.30 


6.28 


1.46 
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fidence ratings obtained on the items in Experiment I, and for each of 
the attribute scalings in Experiment III, The confidence ratings were 
done on a six-point scale, and the scalings on a nine-point scale. The 
results may be summarized by the statemer*" that as compared to words, 
non-words were less confidently judged to be words, were less likely to 
remind a subject of other words, were perceived to occur less frequently 
in printed English, were more distinctive in orthography, and were less 
pronounceable , 

In Table 3 are displayed the int ercorrelat ions of all these measures 
for nonwords and words. With this number of cases, a correlation of .7.1 
is statistically different from zero Cp^'Ol)^ ^nd all the correlations 
in the table pass this criterion. 

As a check on how reliable any regression analyses on these 
data might be, the nonwords and words were ranked separately by order of 
mean confidence judgment from Experiment I, Nonwords and words were 
then split on an odd-even basis into two groups. The "odd" nonwords and 
words were combined, as were the "even" nonwords and words to form two 
groups of items, each consisting of 49 nonwords and 147 words, and these 
will be referred to as Group 1 and Group 2. Group 1 and Group 2 were 
analyzed separately by a stepwise multiple regression analysis. The 
mean scale values for each item on each of the four attributes were 
used as predictors, and the mean confidence rating given the item in 
Experiment I was used as the dependent variable. 
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Table 3 



Intel correlation of Recognition Confidence Judgments from 
Experiirent I and Scaling Judgments from 
Experiment III for Nonwords and Words 



Confidence 

Associabi lity (A) 

Frequency (F) 

Orthography (0) 

Pronounceability (?) 



Nonwords (98) 
A F 0 

.59 .73 -.36 
.69 -.47 
-.48 



P 

.47 
.68 
.60 
.62 



Words (2 94) 
A F 0 

.87 .86 -.53 

.93 -.62 
-.67 



P 

.71 
.76 
.75 
-.78 
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In both regressions^ values from all four scales were entered as significant 
predictors. The resultant multiple correlations were .917 in Group 1 and 
.925 in Group 2o The weightings obtained in Group 1 were then used to pre- 
dict confidence ratings in Group 2. Similarly^ weightings obtained from 
the Group 2 regression were used to predict confidence ratings in Group 1^ 
resulting in r double cross validation on independent samples of items. 
The predicted and actual confidence ratings correlated .915 in Group 1^ 
and .923 in Group 2^ demonstrating remarkably small shrinkage and providing 
evidence for the reliability of the regression analysis results. 

Items were then separated into a group of 98 nonwords and another 
group of 294 words^ and these groupings were submitted to separate step- 
wise multiple regression analyses. The results for these two groupings 
were not the same at all. For the nonwords^ only scaled frequency sig- 
nificantly predicted Experiment I confidence ratings. The multiple cor- 
relation was^ of course^ the same as the simple correlation between rated 
confidence and scaled frequency^ namely .73 (E^=106.63^ df=l ^ 9b ^ 2:<.01)^ 
accounting for 537o of the variance. The F_ to enter the next variable 
(associability) into the prediction equation was not significant (F=3.04^ 
df=1.95 . p>,05).^ and were this variable to have been entered^ it would 
have accounted for just over 1% more of the variance. 

For words^ however^ all four predictor variables significantly entered 
into the prediction of Experiment I confidence ratings. The overall mul- 
tiple correlation was .89^ thus accounting for 79% of the variance. 
Associability^ frequency^ and pronounceability were entered into the pre- 
diction first^ second^ and fourth^ respectively^ all with statistical sig- 
nificance surpassing the .01 level (Fs =906. 2 8^ 21.66^ and 16.64). Ortho- 
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graphic distinctiveness was entered as the third variable^ significant at 
the .05 1 evel (F=^.44^ d f — 1 .290) , The standardized beta weights for asso- 
ciability^ frequency^ orthography^ and pronounceability were calculated 
to be .43^ .42^ .19^ and .21 in that order. It may be concluded^ then^ 
that in the confidence ratings of word recognition^ associability^ frequency^ 
orthography^ and pronounceability were all significant predictors^ but as 
seen from the order of entry into prediction^ and the standardized beta 
weights^ there is reason to believe that the first two^ associability and 
frequency^ were of somewhat more importance. 
Discussion 

Several variables have been suggested as important to a subject in 
the distinction between words and nonwords. This experiment has provided 
further evidence that the confidence with which a subject recognizes a 
stimulus to be a word may be related in some degree to all of these vari- 
ables: associability^ frequency^ orthography^ and pronounceability. Since 
the data were strictly correlational^ no causal inference can be made. 
However^ the data are in line with most of the theoretical conceptions 
discussed in the introduction^ and perhaps best aligned with the position 
of Rubenstein^ et al. (1971). That model postulates pronounceability and 
orthography to be the temporally most important variables. But^ consider- 
ing that most words would pass a check for pronounceability and orthography^ 
these factors would not be expected to^ and^ in fact^ did not play as im- 
portant a role (though they were signii.icant) as did associability and 
frequency in the recognition of words. The Rubenstein et al. model predicted 
i^p.® .■ist)ility (insofar as this is synonomous with their construct ^'meaning'*) 
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and frequency to be the ne^it important factors in identifying words^ in chat 
order. In fact^ associability and frequency correlated with each other^ 
r=.93. 

The big surprise in the experiment was .the prediction of nonwords. 
While it is true that by the nature of the way in which these nonwords 
were constructed^ all of them were pronounceable and of acceptable ortho- 
graphic structure^ it would still seem that if these attributes played a 
significant part in the recognition of words^ they should surely influence 
the confidence of recognition of nonwords. Yet^ these factors had no 
measureable influence in the final regression. 

The associability factor added to nonword confidence of recognition 
at an almost significant level. It is possible that subjects were assoc- 
iating to the nonword as a unit. But^ it might also be hypothesized that 
the subjects had never seen the nonword before and were^ therefore^ more 
likeiy associating to, some portion of the nonword^ as McNulty (1966) might 
suggest. Possibly^ somr of the subjects were associating to parts of some 
of the nonwords^ and this was inconsistent both within and between subjects. 
If this were the case^ then perhaps had all the subjects reliably associated 
to the same portion of the nonword or to the entire item as a unit^ the 
associability scale would have proven to be a significant predictor of 
nonword recognition confidence. 

The only significant predictor of nonword judgments was perceived 
frequency of occurrence in printed English^ accounting for about half the 
variance o This presents some interesting questions. It is obvious^ for 
^'=»xample^ that the scaled frequencies could not be accurate estimates of 
^^^:ue frequencies^ since the true frequencies for all these items were the 
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same^ namely zero- Perhaps subjects were rating the frequencies of individual 
letters^ letter combinations^ syllables^ syllable combinations^ or all of 
these. This presents an obvious empirical question^ but one for which no 
data can be provided in this study. If the subjects really did estimate 
frequency on the basis of some fragments of an item^ and since these were 
presented in a mixed list with real words^ might this imply that subjects 
judged the frequencies of some words in the same way^ i.e.^by a summing of 
the frequency of some word parts? Another possibility is that subjects 
followed two strategies when judging item frequencies; judging an item as 
an integrated unit when it was recognized as a word (when the frequency for 
the unit was perceived as greater than zero)^ and judging it in some seg- 
mented manner when it was not recognized as a word (when the frequency of 
the unit was perceived to be zero). It Is clear that when subjects in an 
experiment have been asked to judge the frequencies of words in the language^ 
they have been assumed to be judging the frequency of the entire unit. It 
is just as clear that subjects could not have been reliably judging the 
frequencies of entire units of nonwords^ all of which had frequencies of 
zero. 

There is another interesting implication of this with regard to a 
matter which was raised in the Introduction to Experiment I. It was stated 
there that in having subjects judge low frequency words for frequency of 
occurrence in printed English^ the Implicit assumption was being made that 
the subjects had ^ in fact^ seen these words before. If^ however^ subjects 
can reliably judge (by whatever means) differences in frequency for items 
that are not even real words^ then such an implicit assumption may not be 
necessary after all. 
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Appendix 

Tliese are the hOO stimuli which were used in the experiments. First are 
the 100 nonwords in alphabetical order. After these are the 300 words in al- 
phabetical order. The following information is listed with each item: the 
mean Experiment I recognition confidence rating (Mn)^ the standard deviation of 
the ratings (SD), an internal consistency measure (r) as described in the Results 
section of Experiment and^ for those items which were selected for the subtest 
(Experiment II), a number indicating ordinal position on the test (?)• 
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